Skip to content

[Quantization] Consolidate experts_int8 with fp8 online quantization#38463

Open
Josephasafg wants to merge 8 commits intovllm-project:mainfrom
Josephasafg:experts_int8_consolidation
Open

[Quantization] Consolidate experts_int8 with fp8 online quantization#38463
Josephasafg wants to merge 8 commits intovllm-project:mainfrom
Josephasafg:experts_int8_consolidation

Conversation

@Josephasafg
Copy link
Copy Markdown
Contributor

@Josephasafg Josephasafg commented Mar 29, 2026

Purpose

Following up on this #38032 - This PR Consolidates experts_int8 with fp8's online quantization infrastructure (QeRL). Extracts shared online MoE quantization logic into a common base class and refactors fp8's MoE kernel infrastructure into a reusable mixin.

Test Plan

experts_int8 and fp8 tests should pass

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Josephasafg <ajgard7@gmail.com>
Signed-off-by: Josephasafg <ajgard7@gmail.com>
Signed-off-by: Josephasafg <ajgard7@gmail.com>
Signed-off-by: Josephasafg <ajgard7@gmail.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the online MoE quantization infrastructure by introducing a common base class, OnlineMoEMethodBase, and a mixin, Fp8MoEKernelMixin, to share logic between different quantization methods. It migrates ExpertsInt8MoEMethod and Fp8OnlineMoEMethod to this new architecture, which utilizes meta-device weight allocation and deferred quantization after model loading. Review feedback identifies potential division-by-zero issues in the int8 quantization loop when encountering zero-valued weight rows and highlights inefficient cross-device tensor allocations for scale parameters that should be created on the same device as the weights.

Signed-off-by: Josephasafg <ajgard7@gmail.com>
Signed-off-by: Josephasafg <ajgard7@gmail.com>
@mergify
Copy link
Copy Markdown

mergify bot commented Mar 29, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Josephasafg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 29, 2026
@mergify mergify bot removed the needs-rebase label Mar 30, 2026
@Josephasafg Josephasafg marked this pull request as ready for review March 30, 2026 07:33
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant